Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
from datetime import datetime
from decimal import Decimal
Template
spark = (
SparkSession.builder
.master("local")
.appName("Section 2 - Performing your First Transformations")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
import os
data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id | breed_id | nickname | birthday | age | color | |
---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None |
2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None |
Transformation
(
pets
.withColumn('birthday_date', F.col('birthday').cast('date'))
.withColumn('owned_by', F.lit('me'))
.withColumnRenamed('id', 'pet_id')
.where(F.col('birthday_date') > datetime(2015,1,1))
).toPandas()
pet_id | breed_id | nickname | birthday | age | color | birthday_date | owned_by | |
---|---|---|---|---|---|---|---|---|
0 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None | 2016-11-22 | me |
1 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | 2016-11-22 | me |
What Happened?
- We renamed the
primary key
of ourdf
- We truncated the precision of our date types.
- we filtered our dataset to a smaller subset.
- We created a new column describing who own these pets.
Summary
We performed a variety of spark transformations to transform our data, we will go through these transformations in detailed in the following section.